
================== Classication with Krimp ======================


=== 0. Setup & get to know Krimp ===

- Before you continue, it is probably a good idea to get to know the 'basic' functionality
of Krimp first. For this, follow the instructions in 'My First Krimp Experiment.txt'.
- Don't forget to prepare the database(s) you want to use for classification, as described 
in that document.


=== 1. Prepare your database for classification ===

- After preparing your database(s) for Krimp, you may wish to add additional info on the
items that define the class label a transaction has.
- Please note: Krimp currently only supports databases in which each transaction has a
single class label. Each possible label should be represented by a single item. I.e. for
binary classification, two items should be present in the original database that are used
to label each transaction.
E.g., with items 0 and 1 as labels, example transactions could look like:
0 3 5 8 9 10
1 3 4 8 10
1 2 6 7 9 11 12 13
The first transaction belongs to class 0, the other two to class 1.
- The 'class definition' tells Krimp which items to regard as class labels. If a database
has a fixed class definition, it is probably good to add this info to your converted .db
database file.
- Note that item numbers are probably changed when converting your original database to
Krimp format. To find out the correction translation of the item numbers, use analysedb 
(see other document, example given as chess.db.analysis.txt).
- A class definition is given by adding a line to the .db file using a plain text editor.
So, open up the file and add a line to the 'header' (which consists of a series of lines 
starting with 2 characters and a semicolon, like ab: and it:). The class definition should
look like:
cl: 0 1
In this case, items 0 and 1 will be considered class labels. More class labels are no
problem, for example:
cl: 3 6 7 9
- (A class definition in a database can be overridden for particular experiments if 
desired, see later on.)


=== 2. Partitioning databases for cross-validation and compressing them ===

- Open up classifycompress.conf.
- The most important configuration directives are:
	* iscName = chess-all-2500d
	This directive tells Krimp which candidates are to be used. Also, the database name is 
	extracted from this directive is not given separately (which is the easiest way to go).

	Format: [dbName]-[itemsetType]-[minSup][candidateOrder]
	
	dbName			- wine, chess, mushroom, ...
	itemsetType		- all, closed
	minSup			- absolute minimum support level
	candidateOrder	- the default Krimp order as described in our papers is 'd'
	
	Note that lower minsups are often possible with classification than with a regular
	Krimp run, as cross-validation is used and the database is split up per class.
	
	* seed = 0
	The number given here is used to seed the random number generator. When set to 0, the
	current time will be used. 
	The random number generator is used to randomize the partitions for cross-validation.

	* #classDefinition = 9 14 22
	If no class definition is given in the database or that one has to be overridden (e.g.,
	a different target is chosen for the current experiment), uncomment and use this 
	directive. Not used by default.
- Run Krimp with this config file. This may take a while, as a lot of compression runs
are to be done. Start with a small database to find out how things work.
- Resulting databases and compression results are stored in Krimp\xps\classify. Each run
gets its own directory, based on configuration and a timestamp.


=== 3. The real thing: classification ===

- Go to the directory created for this experiment, as just mentioned.
- If everything went well, a new configuration file called 'classify.conf' has been 
created in this directory. To do the actual classification, start Krimp with this file.
(This can be done in several ways, the easiest ways probably being 1) copying the file
to the directory where Krimp.exe resides or 2) linking .conf-files to automatically run
with Krimp.exe so that double-clicking the file is enough.)
- Simply run this config file to do the actual classification.
- By default, two types of classification are done. In the configuration files, these 
two types are called 'code table matchings', but in the original paper we refer to
these as 'absolute pairing' and 'relative pairing'. See the paper for more details.
(Please note that for relative pairing, the reportSup in classifycompress.conf has to
be low enough. During the Krimp process, a code table is written to disk every 
'reportSup' decrease in support. For small datasets, a reportSup of 10 would be 
appropiate, whereas this would results in many files written with larger databases. 
Therefore, the default for this directive is 50.)


=== 4. Inspecting your results ===

- All results are stored in the directory created for this particular experiment.
- In a result directory, you'll find 5 files containing results.
	* confusion-[abs|rel].txt
		A confusion matrix for the last code table pairing (lowest minsup/percentage),
		of both absolute (abs) and relative (rel) pairing.
	* classify-[abs|rel].csv
		Detailed results for both relative and absolute pairings, per fold and overall.
		Each row represents a combination(/pairing/matching) of code tables that are
		used for classification. In the absolute case, the minsup is given in the left
		column, in the relative case, the value represents a relative minsup. (Depending
		on the reportSup, not all code tables are available -- for each class, the 
		absolute minsup that is closest to the desired relative minsup is used.)
		On the right, aggregated statistics are shown.
	* summary.csv
		Gives a summary of all results. It gives the best obtained result (highest 
		average accuracy over all folds) and the statistics of the last classification
		(lowest minsup).
		For easy aggregation into a larger file containing more results, all info is 
		also given in a single line.

Reference:
- M. van Leeuwen, J. Vreeken & A. Siebes. Compression Picks Item Sets That Matter, Proc. 
Eur. Conf. on Principles and Practice of KDD, pp. 585-592, 2006.

Contact: 
- Matthijs van Leeuwen 	@ matthijs.vanleeuwen@cs.kuleuven.be
- Jilles Vreeken 		@ jilles.vreeken@ua.ac.be
